To figure out the reason of the Procrastinate Pro+ entertainment application profits decrease in the last few months. And help the company to bring back the profit growth.
Despite huge investments in advertising, the company has been suffering losses for the past few months. The task is to understand the reasons of profit decrease and help the company to growt the profit.
Data contains users registered from May 1 to October 27, 2019:
server log with data about their visits,
logs of their purchases for this period,
advertising expenses.
There's a heed to find out:
where users come from and what devices they use,
how much does user acquisition costs from various advertising channels;
how much money does each client bring,
when the cost of a client acquisition pays off,
what factors prevent customers acquisition.
visits_info_short.csv file stores the server log with information about site visits:
orders_info_short.csv — information about purchases
costs_info_short.csv — information about advertising expenses
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker
import numpy as np
from scipy import stats as st
from datetime import datetime, timedelta
try:
visits = pd.read_csv('visits_info_short.csv')
orders = pd.read_csv('orders_info_short.csv')
costs = pd.read_csv('costs_info_short.csv')
except:
visits = pd.read_csv('datasets/visits_info_short.csv')
orders = pd.read_csv('datasets/orders_info_short.csv')
costs = pd.read_csv('datasets/costs_info_short.csv')
#Function to get primary information about the dataframe
def first_look (df):
print('------------- First 5 lines ------------')
display(df.sample(5))
print('------------- Data types ------------')
display(df.info())
print('------------- Gaps ------------')
for element in df.columns:
if df[element].isna().any().mean() > 0:
print(element, '-', df[element].isna().sum())
else:
print(element, '- None')
print('------------- Duplicates ------------')
if df.duplicated().sum() > 0:
print(df.duplicated().sum())
else:
print('No Duplicates')
first_look(costs)
------------- First 5 lines ------------
| dt | Channel | costs | |
|---|---|---|---|
| 559 | 2019-05-20 | TipTop | 110.20 |
| 87 | 2019-07-27 | FaceBoom | 204.60 |
| 943 | 2019-06-13 | AdNonSense | 9.45 |
| 378 | 2019-05-19 | RocketSuperAds | 24.57 |
| 104 | 2019-08-13 | FaceBoom | 161.70 |
------------- Data types ------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 1800 entries, 0 to 1799 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dt 1800 non-null object 1 Channel 1800 non-null object 2 costs 1800 non-null float64 dtypes: float64(1), object(2) memory usage: 42.3+ KB
None
------------- Gaps ------------ dt - None Channel - None costs - None ------------- Duplicates ------------ No Duplicates
The data is clean, without omissions and duplicates.
The dt column will be converted to the datetime format.
Let's bring the columns to the lower case.
costs['dt'] = pd.to_datetime(costs['dt'], format='%Y-%m-%d').dt.date
costs.columns = costs.columns.str.lower()
display(costs.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1800 entries, 0 to 1799 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dt 1800 non-null object 1 channel 1800 non-null object 2 costs 1800 non-null float64 dtypes: float64(1), object(2) memory usage: 42.3+ KB
None
first_look(orders)
------------- First 5 lines ------------
| User Id | Event Dt | Revenue | |
|---|---|---|---|
| 30163 | 878272987023 | 2019-06-29 19:54:07 | 4.99 |
| 16019 | 731110616083 | 2019-09-03 14:17:25 | 4.99 |
| 10161 | 318964015484 | 2019-07-31 12:57:01 | 4.99 |
| 11422 | 798778681086 | 2019-08-08 06:14:14 | 4.99 |
| 9576 | 204634485980 | 2019-07-27 08:52:55 | 4.99 |
------------- Data types ------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 40212 entries, 0 to 40211 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 User Id 40212 non-null int64 1 Event Dt 40212 non-null object 2 Revenue 40212 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 942.6+ KB
None
------------- Gaps ------------ User Id - None Event Dt - None Revenue - None ------------- Duplicates ------------ No Duplicates
The data is clean, without omissions and duplicates.
Let's bring the columns to the lower case.
The Event Dt column will be converted to the datetime format.
orders.rename(columns={'Event Dt':'event_dt', 'User Id':'user_id'}, inplace=True)
orders.columns = orders.columns.str.lower()
orders['event_dt'] = pd.to_datetime(orders['event_dt'], format='%Y-%m-%d %H:%M:%S')
orders.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 40212 entries, 0 to 40211 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 40212 non-null int64 1 event_dt 40212 non-null datetime64[ns] 2 revenue 40212 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1) memory usage: 942.6 KB
first_look(visits)
------------- First 5 lines ------------
| User Id | Region | Device | Channel | Session Start | Session End | |
|---|---|---|---|---|---|---|
| 28343 | 633742538705 | United States | Android | TipTop | 2019-05-29 04:43:12 | 2019-05-29 05:07:29 |
| 280812 | 909677423232 | UK | PC | OppleCreativeMedia | 2019-09-17 00:57:29 | 2019-09-17 01:16:35 |
| 81126 | 327739919627 | United States | iPhone | organic | 2019-07-20 07:31:15 | 2019-07-20 07:46:03 |
| 192707 | 760253425950 | United States | Android | organic | 2019-10-19 03:49:59 | 2019-10-19 03:58:05 |
| 161940 | 605394818892 | United States | Mac | organic | 2019-09-27 13:30:44 | 2019-09-27 13:42:47 |
------------- Data types ------------ <class 'pandas.core.frame.DataFrame'> RangeIndex: 309901 entries, 0 to 309900 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 User Id 309901 non-null int64 1 Region 309901 non-null object 2 Device 309901 non-null object 3 Channel 309901 non-null object 4 Session Start 309901 non-null object 5 Session End 309901 non-null object dtypes: int64(1), object(5) memory usage: 14.2+ MB
None
------------- Gaps ------------ User Id - None Region - None Device - None Channel - None Session Start - None Session End - None ------------- Duplicates ------------ No Duplicates
The data is clean, without omissions and duplicates.
Let's bring the columns to the lower case.
The Event Dt column will be converted to the datetime format.
visits.rename(columns={'Session Start':'Session_Start', 'Session End':'Session_End', 'User Id':'user_id'}, inplace=True)
visits.columns = visits.columns.str.lower()
visits.sample()
| user_id | region | device | channel | session_start | session_end | |
|---|---|---|---|---|---|---|
| 223605 | 67003681792 | France | PC | organic | 2019-06-02 00:55:32 | 2019-06-02 00:59:25 |
visits['session_start'] = pd.to_datetime(visits['session_start'], format='%Y-%m-%d %H:%M:%S')
visits['session_end'] = pd.to_datetime(visits['session_end'], format='%Y-%m-%d %H:%M:%S')
visits.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 309901 entries, 0 to 309900 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_id 309901 non-null int64 1 region 309901 non-null object 2 device 309901 non-null object 3 channel 309901 non-null object 4 session_start 309901 non-null datetime64[ns] 5 session_end 309901 non-null datetime64[ns] dtypes: datetime64[ns](2), int64(1), object(3) memory usage: 14.2+ MB
display(visits['region'].value_counts())
display(visits['device'].value_counts())
display(visits['channel'].value_counts())
United States 207327 UK 36419 France 35396 Germany 30759 Name: region, dtype: int64
iPhone 112603 Android 72590 PC 62686 Mac 62022 Name: device, dtype: int64
organic 107760 TipTop 54794 FaceBoom 49022 WahooNetBanner 20465 LeapBob 17013 OppleCreativeMedia 16794 RocketSuperAds 12724 YRabbit 9053 MediaTornado 8878 AdNonSense 6891 lambdaMediaAds 6507 Name: channel, dtype: int64
The results of the preview of the Costs, Orders, Visits datasets:
Omissions and duplicates were not found in any dataset.
Column names are reduced to lowercase.
Columns with dates are converted to datetime format.
def get_profiles(visits, orders, costs):
# we find the parameters of the first visits
profiles = (
visits.sort_values(by=['user_id', 'session_start'])
.groupby('user_id')
.agg(
{'session_start': 'first',
'channel': 'first',
'device': 'first',
'region': 'first',})
.rename(columns={'session_start': 'first_ts'})
.reset_index())
# for cohort analysis we determine the date of the first visit the first day of the month
profiles['dt'] = profiles['first_ts'].dt.date
profiles['month'] = profiles['first_ts'].astype('datetime64[M]')
# adding a sign of paying users
profiles['payer'] = profiles['user_id'].isin(orders['user_id'].unique())
# counting the number of unique users
# with the same source and date of acquisition
new_users = (
profiles.groupby(['dt', 'channel'])
.agg({'user_id': 'nunique'})
.rename(columns={'user_id': 'unique_users'})
.reset_index())
# we combine spending on advertising and the number of acquired users
ad_costs = costs.merge(new_users, on=['dt', 'channel'], how='left')
# we divide advertising expenses by the number of acquired users
ad_costs['acquisition_cost'] = ad_costs['costs'] / ad_costs['unique_users']
# adding the cost of acquisition to profiles
profiles = profiles.merge(
ad_costs[['dt', 'channel', 'acquisition_cost']],
on=['dt', 'channel'],
how='left',)
# the cost of acquisition organic users is zero
profiles['acquisition_cost'] = profiles['acquisition_cost'].fillna(0)
return profiles
def get_retention(
profiles,
sessions,
observation_date,
horizon_days,
dimensions=[],
ignore_horizon=False):
dimensions = ['payer'] + dimensions
last_suitable_acquisition_date = observation_date
if not ignore_horizon:
last_suitable_acquisition_date = observation_date - timedelta(days=horizon_days - 1)
result_raw = profiles.query('dt <= @last_suitable_acquisition_date')
result_raw = result_raw.merge(
sessions[['user_id', 'session_start']], on='user_id', how='left')
result_raw['lifetime'] = (result_raw['session_start'] - result_raw['first_ts']).dt.days
def group_by_dimensions(df, dims, horizon_days):
result = df.pivot_table(
index=dims, columns='lifetime', values='user_id', aggfunc='nunique')
cohort_sizes = (
df.groupby(dims).agg({'user_id': 'nunique'})
.rename(columns={'user_id': 'cohort_size'}))
result = cohort_sizes.merge(result, on=dims, how='left').fillna(0)
result = result.div(result['cohort_size'], axis=0)
result = result[['cohort_size'] + list(range(horizon_days))]
result['cohort_size'] = cohort_sizes
return result
result_grouped = group_by_dimensions(result_raw, dimensions, horizon_days)
result_in_time = group_by_dimensions(result_raw, dimensions + ['dt'], horizon_days)
return result_raw, result_grouped, result_in_time
def get_conversion(
profiles,
orders,
observation_date,
horizon_days,
dimensions=[],
ignore_horizon=False,):
last_suitable_acquisition_date = observation_date
if not ignore_horizon:
last_suitable_acquisition_date = observation_date - timedelta(days=horizon_days - 1)
result_raw = profiles.query('dt <= @last_suitable_acquisition_date')
first_purchases = (
orders.sort_values(by=['user_id', 'event_dt'])
.groupby('user_id')
.agg({'event_dt': 'first'})
.reset_index())
result_raw = result_raw.merge(
first_purchases[['user_id', 'event_dt']], on='user_id', how='left')
result_raw['lifetime'] = (
result_raw['event_dt'] - result_raw['first_ts']).dt.days
if len(dimensions) == 0:
result_raw['cohort'] = 'All users'
dimensions = dimensions + ['cohort']
def group_by_dimensions(df, dims, horizon_days):
result = df.pivot_table(
index=dims, columns='lifetime', values='user_id', aggfunc='nunique')
result = result.fillna(0).cumsum(axis = 1)
cohort_sizes = (
df.groupby(dims)
.agg({'user_id': 'nunique'})
.rename(columns={'user_id': 'cohort_size'}))
result = cohort_sizes.merge(result, on=dims, how='left').fillna(0)
result = result.div(result['cohort_size'], axis=0)
result = result[['cohort_size'] + list(range(horizon_days))]
result['cohort_size'] = cohort_sizes
return result
result_grouped = group_by_dimensions(result_raw, dimensions, horizon_days)
if 'cohort' in dimensions:
dimensions = []
result_in_time = group_by_dimensions(
result_raw, dimensions + ['dt'], horizon_days)
return result_raw, result_grouped, result_in_time
def get_ltv(
profiles,
orders,
observation_date,
horizon_days,
dimensions=[],
ignore_horizon=False,):
last_suitable_acquisition_date = observation_date
if not ignore_horizon:
last_suitable_acquisition_date = observation_date - timedelta(days=horizon_days - 1)
result_raw = profiles.query('dt <= @last_suitable_acquisition_date')
result_raw = result_raw.merge(
orders[['user_id', 'event_dt', 'revenue']], on='user_id', how='left')
result_raw['lifetime'] = (
result_raw['event_dt'] - result_raw['first_ts']).dt.days
if len(dimensions) == 0:
result_raw['cohort'] = 'All users'
dimensions = dimensions + ['cohort']
def group_by_dimensions(df, dims, horizon_days):
result = df.pivot_table(
index=dims, columns='lifetime', values='revenue', aggfunc='sum')
result = result.fillna(0).cumsum(axis=1)
cohort_sizes = (
df.groupby(dims)
.agg({'user_id': 'nunique'})
.rename(columns={'user_id': 'cohort_size'}))
result = cohort_sizes.merge(result, on=dims, how='left').fillna(0)
result = result.div(result['cohort_size'], axis=0)
result = result[['cohort_size'] + list(range(horizon_days))]
result['cohort_size'] = cohort_sizes
cac = df[['user_id', 'acquisition_cost'] + dims].drop_duplicates()
cac = (
cac.groupby(dims)
.agg({'acquisition_cost': 'mean'})
.rename(columns={'acquisition_cost': 'cac'}))
roi = result.div(cac['cac'], axis=0)
roi = roi[~roi['cohort_size'].isin([np.inf])]
roi['cohort_size'] = cohort_sizes
roi['cac'] = cac['cac']
roi = roi[['cohort_size', 'cac'] + list(range(horizon_days))]
return result, roi
result_grouped, roi_grouped = group_by_dimensions(result_raw, dimensions, horizon_days)
if 'cohort' in dimensions:
dimensions = []
result_in_time, roi_in_time = group_by_dimensions(
result_raw, dimensions + ['dt'], horizon_days)
return (
result_raw,
result_grouped,
result_in_time,
roi_grouped,
roi_in_time)
def filter_data(df, window):
for column in df.columns.values:
df[column] = df[column].rolling(window).mean()
return df
def plot_retention(retention, retention_history, horizon, window=7):
plt.figure(figsize=(15, 10))
retention = retention.drop(columns=['cohort_size'])
retention_history = retention_history.drop(columns=['cohort_size'])[
[horizon - 1]]
if retention.index.nlevels == 1:
retention['cohort'] = 'All users'
retention = retention.reset_index().set_index(['cohort', 'payer'])
ax1 = plt.subplot(2, 2, 1)
retention.query('payer == True').droplevel('payer').T.plot(
grid=True, ax=ax1)
plt.legend()
plt.xlabel('Lifetime')
plt.title('Retention of paying users')
ax2 = plt.subplot(2, 2, 2, sharey=ax1)
retention.query('payer == False').droplevel('payer').T.plot(
grid=True, ax=ax2)
plt.legend()
plt.xlabel('Lifetime')
plt.title('Retention of non-paying users')
ax3 = plt.subplot(2, 2, 3)
columns = [
name
for name in retention_history.index.names
if name not in ['dt', 'payer']]
filtered_data = retention_history.query('payer == True').pivot_table(
index='dt', columns=columns, values=horizon - 1, aggfunc='mean'
)
filter_data(filtered_data, window).plot(grid=True, ax=ax3)
plt.xlabel('Date of customer acquisition')
plt.title('Dynamics of retention of paying users on the {}th day'.format(horizon))
ax4 = plt.subplot(2, 2, 4, sharey=ax3)
filtered_data = retention_history.query('payer == False').pivot_table(
index='dt', columns=columns, values=horizon - 1, aggfunc='mean'
)
filter_data(filtered_data, window).plot(grid=True, ax=ax4)
plt.xlabel('Date of customer acquisition')
plt.title('Dynamics of retention of non-paying users on the {}th day'.format(horizon))
plt.tight_layout()
plt.show()
def plot_conversion(conversion, conversion_history, horizon, window=7):
plt.figure(figsize=(15, 5))
conversion = conversion.drop(columns=['cohort_size'])
conversion_history = conversion_history.drop(columns=['cohort_size'])[[horizon - 1]]
ax1 = plt.subplot(1, 2, 1)
conversion.T.plot(grid=True, ax=ax1)
plt.legend()
plt.xlabel('Lifetime')
plt.title('User conversion')
ax2 = plt.subplot(1, 2, 2, sharey=ax1)
columns = [name for name in conversion_history.index.names if name not in ['dt']]
filtered_data = conversion_history.pivot_table(
index='dt', columns=columns, values=horizon - 1, aggfunc='mean')
filter_data(filtered_data, window).plot(grid=True, ax=ax2)
plt.xlabel('Date of customer acquisition')
plt.title('Dynamics of user conversion on the {}th day'.format(horizon))
plt.tight_layout()
plt.show()
def plot_ltv_roi(ltv, ltv_history, roi, roi_history, horizon, window=14):
plt.figure(figsize=(20, 15))
ltv = ltv.drop(columns=['cohort_size'])
ltv_history = ltv_history.drop(columns=['cohort_size'])[[horizon - 1]]
cac_history = roi_history[['cac']]
roi = roi.drop(columns=['cohort_size', 'cac'])
roi_history = roi_history.drop(columns=['cohort_size', 'cac'])[[horizon - 1]]
ax1 = plt.subplot(2, 3, 1)
ltv.T.plot(grid=True, ax=ax1)
plt.legend()
plt.xlabel('Lifetime')
plt.title('LTV')
ax2 = plt.subplot(2, 3, 2, sharey=ax1)
columns = [name for name in ltv_history.index.names if name not in ['dt']]
filtered_data = ltv_history.pivot_table(
index='dt', columns=columns, values=horizon-1, aggfunc='mean')
filter_data(filtered_data, window).plot(grid=True, ax=ax2)
plt.xlabel('Date of customer acquisition')
plt.title('Dynamics of LTV users on the {}th day'.format(horizon))
ax3 = plt.subplot(2, 3, 3, sharey=ax1)
columns = [name for name in cac_history.index.names if name not in ['dt']]
filtered_data = cac_history.pivot_table(
index='dt', columns=columns, values='cac', aggfunc='mean')
filter_data(filtered_data, window).plot(grid=True, ax=ax3)
plt.xlabel('Date of customer acquisition')
plt.title('Dynamics of the cost of customer acquisition')
ax4 = plt.subplot(2, 3, 4)
roi.T.plot(grid=True, ax=ax4)
plt.axhline(y=1, color='red', linestyle='--', label='Payback level')
plt.legend()
plt.xlabel('Lifetime')
plt.title('ROI')
ax5 = plt.subplot(2, 3, 5, sharey=ax4)
columns = [name for name in roi_history.index.names if name not in ['dt']]
filtered_data = roi_history.pivot_table(
index='dt', columns=columns, values=horizon-1, aggfunc='mean')
filter_data(filtered_data, window).plot(grid=True, ax=ax5)
plt.axhline(y=1, color='red', linestyle='--', label='Payback level')
plt.xlabel('Date of customer acquisition')
plt.title('Dynamics of user ROI on the {}th day'.format(horizon))
plt.tight_layout()
plt.show()
profiles = get_profiles(visits, orders, costs)
display(profiles.tail(5))
| user_id | first_ts | channel | device | region | dt | month | payer | acquisition_cost | |
|---|---|---|---|---|---|---|---|---|---|
| 150003 | 999956196527 | 2019-09-28 08:33:02 | TipTop | iPhone | United States | 2019-09-28 | 2019-09-01 | False | 3.500000 |
| 150004 | 999975439887 | 2019-10-21 00:35:17 | organic | PC | UK | 2019-10-21 | 2019-10-01 | False | 0.000000 |
| 150005 | 999976332130 | 2019-07-23 02:57:06 | TipTop | iPhone | United States | 2019-07-23 | 2019-07-01 | False | 2.600000 |
| 150006 | 999979924135 | 2019-09-28 21:28:09 | MediaTornado | PC | United States | 2019-09-28 | 2019-09-01 | False | 0.205714 |
| 150007 | 999999563947 | 2019-10-18 19:57:25 | organic | iPhone | United States | 2019-10-18 | 2019-10-01 | False | 0.000000 |
observation_date = datetime(2019, 11, 1).date()
horizon_days = 14
min_time = profiles['first_ts'].min().date()
max_time = profiles['first_ts'].max().date()
display(f'Minimum user engagement date - {min_time}')
display(f'Maximum user engagement date - {max_time}')
'Minimum user engagement date - 2019-05-01'
'Maximum user engagement date - 2019-10-27'
(profiles
.pivot_table(index='region',
values=['acquisition_cost','user_id','payer'],
aggfunc={'acquisition_cost':'sum','user_id':'count','payer':'mean'})
.sort_values(by='payer', ascending=False))
| acquisition_cost | payer | user_id | |
|---|---|---|---|
| region | |||
| United States | 90928.600000 | 0.069019 | 100002 |
| Germany | 4424.223235 | 0.041119 | 14981 |
| UK | 5068.502819 | 0.039829 | 17575 |
| France | 5075.973946 | 0.037994 | 17450 |
(profiles
.pivot_table(index='device',
values=['acquisition_cost','user_id','payer'],
aggfunc={'acquisition_cost':'sum','user_id':'count','payer':'mean'})
.sort_values(by='payer', ascending=False))
| acquisition_cost | payer | user_id | |
|---|---|---|---|
| device | |||
| Mac | 24094.537878 | 0.063644 | 30042 |
| iPhone | 43678.788251 | 0.062079 | 54479 |
| Android | 22460.629159 | 0.058518 | 35032 |
| PC | 15263.344712 | 0.050468 | 30455 |
(profiles
.pivot_table(index='channel',
values=['acquisition_cost','user_id','payer'],
aggfunc={'acquisition_cost':'sum','user_id':'count','payer':'mean'})
.sort_values(by='payer', ascending=False))
| acquisition_cost | payer | user_id | |
|---|---|---|---|
| channel | |||
| FaceBoom | 32445.60 | 0.122049 | 29144 |
| AdNonSense | 3911.25 | 0.113402 | 3880 |
| lambdaMediaAds | 1557.60 | 0.104700 | 2149 |
| TipTop | 54751.30 | 0.096007 | 19561 |
| RocketSuperAds | 1833.00 | 0.079137 | 4448 |
| WahooNetBanner | 5151.00 | 0.052964 | 8553 |
| YRabbit | 944.22 | 0.038265 | 4312 |
| MediaTornado | 954.48 | 0.035747 | 4364 |
| LeapBob | 1797.60 | 0.030633 | 8553 |
| OppleCreativeMedia | 2151.25 | 0.027077 | 8605 |
| organic | 0.00 | 0.020553 | 56439 |
Advertising costs in the US are 6 times more than 90 thousand units than in France, Germany and the UK combined.
Advertising costs in France, Germany and the UK are about the same - 5 thousand cu.
At the same time, the number of US users is only 2 times more than the alliance of France, Germany and UK.
The US also has the highest percentage of paying users - 70%, compared to 40% in other countries.
iPhone users cost more (44,000 USD) than Mac and Android (23,000 USD) and PC (15,000 USD).
Paying users are distributed fairly evenly across the platforms - 60-50%.
Mac conversion leader is 6%, PC users are the least converted - 5%
Spending on TipTop and FaceBoom is 4 times more than on all other channels.
The number of users asquired through TipTop and FaceBoom is greater than the number of users of all other channels.
Notable leaders of the conversion in paying users - FaceBoom, AdNonSense, lambdaMediaAds, TipTop
Interestingly, FaceBoom's conversion rate is 12%, which is more than TipTop's - 9.6%
display(f'Total marketing expenses = {costs["costs"].sum():.2f}')
t1 = costs.groupby('channel').sum().sort_values(by='costs', ascending=False)
t1['ratio'] = round(t1['costs']/t1['costs'].sum(), 3)
display(t1)
'Total marketing expenses = 105497.30'
| costs | ratio | |
|---|---|---|
| channel | ||
| TipTop | 54751.30 | 0.519 |
| FaceBoom | 32445.60 | 0.308 |
| WahooNetBanner | 5151.00 | 0.049 |
| AdNonSense | 3911.25 | 0.037 |
| OppleCreativeMedia | 2151.25 | 0.020 |
| RocketSuperAds | 1833.00 | 0.017 |
| LeapBob | 1797.60 | 0.017 |
| lambdaMediaAds | 1557.60 | 0.015 |
| MediaTornado | 954.48 | 0.009 |
| YRabbit | 944.22 | 0.009 |
costs.pivot_table(
index='dt', columns='channel', values='costs', aggfunc='mean'
).plot(grid=True, figsize=(16,8))
plt.xlabel('Date of purchase')
plt.title('Dynamics of marketing expenses');
t1 = (profiles
.query('channel != "organic"')
.pivot_table(index='channel',
values=['acquisition_cost','user_id'],
aggfunc={'acquisition_cost':'mean','user_id':'count'})
.sort_values(by='acquisition_cost', ascending=False))
mean_cac = profiles.query('channel != "organic"')['acquisition_cost'].mean()
display(f'Average CAC per user {mean_cac:.3f}');
display(t1)
'Average CAC per user 1.127'
| acquisition_cost | user_id | |
|---|---|---|
| channel | ||
| TipTop | 2.799003 | 19561 |
| FaceBoom | 1.113286 | 29144 |
| AdNonSense | 1.008054 | 3880 |
| lambdaMediaAds | 0.724802 | 2149 |
| WahooNetBanner | 0.602245 | 8553 |
| RocketSuperAds | 0.412095 | 4448 |
| OppleCreativeMedia | 0.250000 | 8605 |
| YRabbit | 0.218975 | 4312 |
| MediaTornado | 0.218717 | 4364 |
| LeapBob | 0.210172 | 8553 |
# Exclude organic users from the sample
profiles = profiles.query('channel != "organic"')
ltv_raw, ltv_grouped, ltv_history, roi_grouped, roi_history = get_ltv(
profiles, orders, observation_date, horizon_days, dimensions=[], ignore_horizon=False)
plot_ltv_roi(ltv_grouped, ltv_history, roi_grouped, roi_history, horizon_days,window=14)
dimensions = ['channel']
ltv_raw, ltv_grouped, ltv_history, roi_grouped, roi_history = get_ltv(
profiles, orders, observation_date, horizon_days, dimensions=dimensions, ignore_horizon=False)
plot_ltv_roi(ltv_grouped, ltv_history, roi_grouped, roi_history, horizon_days,window=14)
dimensions = ['region']
ltv_raw, ltv_grouped, ltv_history, roi_grouped, roi_history = get_ltv(
profiles, orders, observation_date, horizon_days, dimensions=dimensions, ignore_horizon=False)
plot_ltv_roi(ltv_grouped, ltv_history, roi_grouped, roi_history, horizon_days,window=14)
profiles_US = profiles.query('region == "United States"')
dimensions = ['channel']
ltv_raw, ltv_grouped, ltv_history, roi_grouped, roi_history = get_ltv(
profiles_US, orders, observation_date, horizon_days, dimensions=dimensions, ignore_horizon=False)
plot_ltv_roi(ltv_grouped, ltv_history, roi_grouped, roi_history, horizon_days,window=14)
profiles_NUS = profiles.query('region != "United States"')
dimensions = ['channel']
ltv_raw, ltv_grouped, ltv_history, roi_grouped, roi_history = get_ltv(
profiles_NUS, orders, observation_date, horizon_days, dimensions=dimensions, ignore_horizon=False)
plot_ltv_roi(ltv_grouped, ltv_history, roi_grouped, roi_history, horizon_days,window=14)
dimensions = ['device']
conversion, conversion_grouped, conversion_history = get_conversion(
profiles, orders, observation_date, horizon_days, dimensions=dimensions, ignore_horizon=False)
plot_conversion(conversion_grouped, conversion_history, horizon_days, window=14)
retention, retention_grouped, retention_history = get_retention(
profiles, visits, observation_date, horizon_days, dimensions=dimensions)
plot_retention(retention_grouped, retention_history, horizon_days, window=14)
dimensions = ['region']
conversion, conversion_grouped, conversion_history = get_conversion(
profiles, orders, observation_date, horizon_days, dimensions=dimensions, ignore_horizon=False)
plot_conversion(conversion_grouped, conversion_history, horizon_days, window=14)
retention, retention_grouped, retention_history = get_retention(
profiles, visits, observation_date, horizon_days, dimensions=dimensions)
plot_retention(retention_grouped, retention_history, horizon_days, window=14)
dimensions = ['channel']
conversion, conversion_grouped, conversion_history = get_conversion(
profiles, orders, observation_date, horizon_days, dimensions=dimensions, ignore_horizon=False)
plot_conversion(conversion_grouped, conversion_history, horizon_days, window=14)
retention, retention_grouped, retention_history = get_retention(
profiles, visits, observation_date, horizon_days, dimensions=dimensions)
plot_retention(retention_grouped, retention_history, horizon_days, window=14)
# # Let's take a closer look at the users of the FaceBoom channel
profiles_FB = profiles.query('channel == "FaceBoom"')
dimensions = ['device']
ltv_raw, ltv_grouped, ltv_history, roi_grouped, roi_history = get_ltv(
profiles_FB, orders, observation_date, horizon_days, dimensions=dimensions, ignore_horizon=False)
plot_ltv_roi(ltv_grouped, ltv_history, roi_grouped, roi_history, horizon_days,window=14)
profiles_AS = profiles.query('channel == "AdNonSense"')
dimensions = ['device']
ltv_raw, ltv_grouped, ltv_history, roi_grouped, roi_history = get_ltv(
profiles_AS, orders, observation_date, horizon_days, dimensions=dimensions, ignore_horizon=False)
plot_ltv_roi(ltv_grouped, ltv_history, roi_grouped, roi_history, horizon_days,window=14)
profiles_TT = profiles.query('channel == "TipTop"')
dimensions = ['device']
ltv_raw, ltv_grouped, ltv_history, roi_grouped, roi_history = get_ltv(
profiles_TT, orders, observation_date, horizon_days, dimensions=dimensions, ignore_horizon=False)
plot_ltv_roi(ltv_grouped, ltv_history, roi_grouped, roi_history, horizon_days,window=14)
It seems that FaceBoom is the worst, but at the same time quite expensive advertising channel. It should be abandoned.
And TipTop had an abnormal price increase - a multiple jump every month, perhaps we hit the peak of the channel's popularity. At the same time, TipTop supplies good users, but due to the increase in cost, the costs for them do not pay off.
It is necessary to discuss cost reduction, it is possible to increase the volume of purchases at the expense of part of the budget on FaceBoom.
Which tells us that the US has very responsive users, but there is a failure in retention, which results in a sharp increase in CAC. It is worth checking the technical side of retention for errors at all stages of the "user's life"
Based on the charts of marketing metrics 8.3.1-2, we can recommend more promising channels than the problematic FaceBoom and expensive TipTop:
United States:
UK, Germany, France